Molecular Plant 


Research article 


Q CelPress 


Partner Journal 


Macroalgal deep genomics illuminate multiple 
paths to aquatic, photosynthetic multicellularity 


David R. Nelson':^:9^, Alexandra Mystikou ^? 5^, Ashish Jaiswal', Cecilia Rad-Menendez^, 
Michael J. Preston®, Frederik De Boever^, Diana C. El Assal', Sarah Daakour'-, 
Michael W. Lomas®, Jean-Claude Twizere ^6, David H. Green‘, William C. Ratcliff’ 


and Kourosh Salehi-Ashtiani ^:* 

‘Division of Science and Math, New York University Abu Dhabi, Abu Dhabi, UAE 

?Center for Genomics and Systems Biology (CGSB), New York University Abu Dhabi, Abu Dhabi, UAE 
Biotechnology Research Center, Technology Innovation Institute, PO Box 9639, Masdar City, Abu Dhabi, UAE 
^Culture Collection of Algae and Protozoa, Scottish Association for Marine Science, Oban, Scotland, UK 

5National Center for Marine Algae and Microbiota, Bigelow Laboratory for Ocean Sciences, East Boothbay, ME, USA 
$L aboratory of Viral Interactomes, GIGA Institute, University of Liege, Liege, Belgium 

"School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA 

8These authors contributed equally to this article. 

*Correspondence: David R. Nelson (drn2@nyu.edu), Alexandra Mystikou (alexandra.mystikou@tii.ae), Kourosh Salehi-Ashtiani (ksa3@nyu.edu) 
https://doi.org/10.1016/j.molp.2024.03.01 1 


ABSTRACT 


Macroalgae are multicellular, aquatic autotrophs that play vital roles in global climate maintenance and 
have diverse applications in biotechnology and eco-engineering, which are directly linked to their multicel- 
lularity phenotypes. However, their genomic diversity and the evolutionary mechanisms underlying multi- 
cellularity in these organisms remain uncharacterized. In this study, we sequenced 110 macroalgal 
genomes from diverse climates and phyla, and identified key genomic features that distinguish them 
from their microalgal relatives. Genes for cell adhesion, extracellular matrix formation, cell polarity, trans- 
port, and cell differentiation distinguish macroalgae from microalgae across all three major phyla, consti- 
tuting conserved and unique gene sets supporting multicellular processes. Adhesome genes show 
phylum- and climate-specific expansions that may facilitate niche adaptation. Collectively, our study re- 
veals genetic determinants of convergent and divergent evolutionary trajectories that have shaped 
morphological diversity in macroalgae and provides genome-wide frameworks to understand photosyn- 
thetic multicellular evolution in aquatic environments. 
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Shifts to simpler multicellularity from unicellularity (e.g., in 
Volvox carterii) (Prochnik et al., 2010) may not show significant 
changes in gene content, instead resulting primarily from 
differential regulation of gene expression (Matt and Umen, 
2018) and splicing patterns (Balasubramanian et al., 2023). 
Indeed, shifts to multicellularity can be observed experimentally 
with little or no change in basal genetic content (Ratcliff et al., 
2012), and some lineages have reverted to unicellular forms 
after evolutionary periods of multicellularity. Thus, transitions to 


INTRODUCTION 


Macroalgae encompass a diverse array of organisms belonging 
to three distinct phyla, each of which has independently evolved 
multicellularity. These organisms demonstrate extraordinary 
morphological diversity, ranging from millimeter-scale filaments 
to colossal 65-m-long kelps, and inhabit a variety of ecological 
niches. The simpler form of multicellularity, filamentous growth, 
is common in macroalgae and parallels that seen in other simpler 
multicellular phototroph lineages, such as Zygnematophyceae 
(Hess et al., 2022). While many algal lineages have adopted 


simple multicellularity, complex phototrophic multicellularity is a 
defining feature of plants and macroalgae (Ad! et al., 2012). 


Published by the Molecular Plant Shanghai Editorial Office in association with 
Cell Press, an imprint of Elsevier Inc., on behalf of CSPB and CEMPS, CAS. 


Molecular Plant 17, 747-771, May 6 2024 © 2024 The Author. 747 


This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). 


Molecular Plant 


complex multicellularity are more likely to involve newly acquired, 
lineage-defining gene sets, which can be discovered through 
extensive genome sequencing. 


The three principal phyla (Adl et al., 2019) of marine macroalgae — 
Rhodophyta (Brawley et al., 2017) (red algae), Chlorophyta (He 
et al., 2021) (green algae), and Ochrophyta (Bringloe et al., 
2020) (brown algae)—exemplify the convergent evolution of 
complex multicellularity in aquatic phototrophs. Studies on 
Volvox carteri, a simple multicellular green alga, shed light on 
the early stages of chlorophyte germ-soma differentiation, a 
key feature of multicellularity (Ferris et al., 2010; Prochnik et al., 
2010; Matt and Umen, 2016, 2018; Umen, 2020; Yamamoto 
et al., 2021; Balasubramanian et al., 2023). The comprehensive 
analysis of Volvox cell-type transcriptomes revealed distinct 
molecular and metabolic programming between these cell 
types (Matt and Umen, 2018). Nonetheless, while regulatory 
changes play a key role in the evolution of multicellularity in the 
volvocine green algae, no significant changes in gene content 
were found compared to its closest unicellular relatives. While 
increased sophistication in ECMs are hallmarks of many simple 
multicellular lineages, their higher complexity usually derives 
from gene regulatory changes and not acquisitions (Kloareg 
et al., 2021). 


Macroalgal lineages evolved specific metabolic and morpholog- 
ical traits underlying their multicellular phenotypes, which form 
independent bases for their evolution of multicellularity. Investi- 
gating the genetic foundations of multicellularity in these lineages 
is crucial for understanding how complex life adapts to varying 
environmental challenges and for gaining insights into the genetic 
underpinnings of this significant evolutionary transition. A scarcity 
of macroalgal genome sequences has thus far constrained the 
development of a comprehensive comparative analysis of macro- 
algal multicellularity. Increased genome sample sizes will facilitate 
the statistical resolution of hypothesis tests regarding phyla-wide 
gene gains and losses accompanying the rise to multicellularity in 
the three clades. In this study, we greatly expanded the number of 
available macroalgal genomes from 14 to 124, shedding new light 
on the genomic basis of macroalgal multicellularity. 


Rhodophyta and Chlorophyta algae belong to the Archaeplastida 
supergroup descending from the original plastid/nuclear symbi- 
otic event in eukaryotic algae (Yoon et al., 2004; Brawley et al., 
2017). In contrast, Ochrophyta contain Rhodophyta algal 
plastids derived from further endosymbiosis (Bringloe et al., 
2020). Ochrophyta evolution through this route may have 
involved multiple secondary endosymbioses, rather than a 
single event (Yoon et al., 2004). Macroalgae thrive in various 
conditions in all climates in marine, freshwater, and brackish 
environments. They can grow in intertidal and benthic regions 
as deep as 268 m (Littler et al., 1985). They represent a 
particularly powerful system for examining the environmental 
and genetic drivers of multicellular evolution, as independent 
transitions in these clades were separated by more than a billion 
years and occurred under fundamentally different environmental 
conditions (Yoon et al., 2004; Brawley et al., 2017). 


The evolution of algal lineages has been meticulously analyzed 
using molecular clock experiments and geological evidence 
(Heckman et al., 2001; Douzery et al., 2004; Hedges et al., 
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2004; Yoon et al., 2004; Berney and Pawlowski, 2006; 
Zimmer et al., 2007; Herron et al., 2009; Lang et al., 2010; 
Fiz-Palacios et al., 2011; Gueidan et al., 2011; Parfrey et al., 
2011; Blank, 2013; Gaya et al., 2015; Munakata et al., 2016; 
Yang et al, 2016; Caspermeyer, 2017). Ochrophyta and 
Rhodophyta diverged ~1.6 billion years ago (bya), and 
rhodophytes and chlorophytes split ~1.2 bya. Chlorophyta 
and rhodophyte macroalgae emerged more than 1 bya; in 
contrast, the stramenopile-origin macroalgal Phaeophyceae 
only emerged ~200 million years ago (mya) (Supplemental 
Figure 1). Phaeophyte seaweeds thus evolved in a period 
with comparably high atmospheric O2 and solar luminosity 
but low CO; (Supplemental Figure 1). The global prevalence 
of the recently emerged Phaeophyceae, evident in extensive 
giant kelp forests and abundant seaweed biomass, indicates 
they are well adapted to thrive in contemporary environmental 
conditions. For instance, Sargassum emerged only 6.7 mya 
(Yip et al, 2020), yet has swiftly colonized the world's 
oceans, generating significant biomass (Arellano-Verdejo and 
Lazcano-Hernandez, 2021; Liu et al., 2021; Song et al., 
2021; Stelling-Wood et al., 2021; Mulders et al., 2022). 
Invasive phaeophytes threaten various industries and 
ecosystems; understanding the genomic and evolutionary 
bases of their success will facilitate management efforts and 
biotechnological utilization (Charrier et al., 2017). 


Our large-scale project, ALG-ALL-CODE-macro, sequenced 110 
macroalgal genomes from diverse climates and phyla, offering 
new insights into macroalgal biology and the evolution of multi- 
cellularity. We identified key genes for processes necessary for 
multicellular phenotypes (e.g., cellular adhesion, transcription 
factors [TFs]). Many of these had viral origins and were unique 
to or conserved among the three primary three macroalgal line- 
ages. The diverse viral-origin sequences in various proteins un- 
derpin lineage-specific mechanisms that support the evolution 
of multicellular macroalgae with differentiated, coordinated tis- 
sues. Our study illuminates the convergent and divergent evolu- 
tionary trajectories shaping macroalgae's diverse morphologies, 
outlining a basis for understanding photosynthetic multicellular 
evolution in marine environments. 


RESULTS AND DISCUSSION 


Comparative genomics of macroalgae 


To understand the genomic principles underpinning the evolution 
of macroalgal multicellularity, we sequenced 110 new macroalgal 
genomes grown in international culture centers, of which 105 
are different species. All available macroalgal cultures from the 
Culture Collection of Algae and Protozoa (CCAP), Scotland, UK 
(www.ccap.ac.uk) and the National Center for Marine Algae 
and Microbiota (NCMA), Maine, USA (ncma.bigelow.org) were 
cultured for downstream genome sequencing. Five species have 
been sequenced twice due to their availability and presence 
in both culture collection centers in separate cultures: Bostrychia 
radicans  CCAP1357/2 and CCMP2677; Catenella nipae 
CCMP2850 and CCMP2848; Porphyra pulchella CCMP3230 and 
CCAP1379 3; Pylaiella littoralis CCMP1907 and CCAP1330/9; 
and Ulvella leptochaete CCAP6037/1 and CCAP6000/1. One spe- 
cies was sequenced thrice from three different cultures from both 
culture collection centers: Bostrychia moritziana CCAP1357/3, 
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Figure 1. Geographies, habitats, climates, and divergence estimates for phyla of the newly sequenced macroalgae. 
(A) Global distribution of the 110 macroalgal isolates whose DNA was extracted to reconstruct genomes for ALG-ALL-CODEmacro by phylum. 
(B) Flow diagram showing the native habitat and climate of sequenced macroalgae. The color scale shows the average temperature for all isolates 
represented by a given band. Distributions of taxa correlate with their availability in culture centers and roughly with their natural distributions, with 
Rhodophyta having many more extant species than the other clades. 

(legend continued on next page) 
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CCAP1357/8 and CCMP3247. The resequencing was done for 
benchmarking purposes (Supplemental Table 1) and to examine 
sequences unique to either culture accession. Species from 
diverse geographies (Figure 1B and 1C, Supplemental Table 1), 
including those from temperate, tropical, subtropical, polar, and 
subpolar climates, and from five major latitude parallels were 
cultured for genome sequencing. These species are from the 
three major macroalgal phyla that emerged epochs apart in 
vastly different geological conditions (Supplemental Figure 1). 


In addition to the 110 international macroalgal culture samples, 
nine macroalgal species from field samples we previously recov- 
ered from the UAE (https://doi.org/10.5281/zenodo.7758508) 
were included in our analyses. The expanded genome oollection 
presented here, from 14 to 124 genomes, helps to resolve long- 
standing questions regarding macroalgal biology and the evolu- 
tion of multicellularity in different geological and environmental 
time frames spanning epochs. Our dataset of 110 genomes 
represents a substantial improvement in both quantity and quality 
compared to previously sequenced macroalgal genomes 
(Supplemental Table 1 and Figure 1C-1E). After sequencing, 
genome assemblies were computationally decontaminated 
using a pipeline developed in house (Supplemental Figure 2, 
Supplemental Data 3). This pipeline is available for use in non- 
macroalgal, eukaryotic species and offers substantial improve- 
ments in accuracy and batch processing over other computa- 
tional decontamination methods. Furthermore, the separated 
contaminant sequences and their homologs in the National Cen- 
ter for Biotechnology Information (NCBI) non-redundant (nr) pro- 
tein database are also included in Supplemental Data 3. 


Asaninitial, high-level analysis, we analyzed the degree of shared 
vs. unique genes by examining the distribution of protein families 
(PFAMs) in the three lineages using ternary (Ponsen et al., 2009) 
graphing (Figure 2A) and response screening (Supplemental 
Table 2). Genes unique to each lineage could play important 
roles in the independent acquisition of multicellularity in the 
three lineages, while common PFAMs highlight possible evolu- 
tionary convergence (Figure 2). Batch t-tests were used to 
compare PFAMs by phyla. We found that 1065 PFAMs signifi- 
cantly differed between clades (adjusted [adj.] p «0.05), while 
9233 PFAMs were significantly equivalent among the three phyla 
(eq. p «0.05, Supplemental Table 2). These results produced 
similar findings to the ternary analysis (Figure 2, Supplemental 
Table 2). Protein families near the center of the ternary graph 
represent conserved functions, while peripheral PFAMs at the 
corners of the three lineages indicate lineage-specific inclusions. 
We extracted PFAM sets at incremental purity cutoffs in each of 
the three extremes in the ternary graph. Purity is defined by the 
degree which the PFAM is unique to any given lineage or present 
at higher copy numbers than the other two. Higher-purity sets 
represent specialized gene sets for each clade (Figure 2). Sets 
of filtered PFAMs (>80% purity) from the three lineages were 
used in hypergeometric distribution tests to analyze functional 
enrichment in domain-centric Gene Ontology (dcGO) (Fang and 
Gough, 2013) analyses (Figure 2). The PFAM domains yield 
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information about Gene Ontology (GO) enrichment in a domain- 
centric solution to functional enrichment analysis. Only the 
Ochrophyta and Chlorophyta showed enrichment in any GO 
term at >80% purity. Ochrophyta (n = 43 genomes) were enriched 
for 646 GO terms using the isolated PFAM set as dcGO 
queries (Figure 2). Ochrophyta-enriched GO terms at a PFAM pu- 
rity >80% included “multicellular organismal process" (GO: 
0032501, Z = 4.61, adj. p = 4.12 x 107^, “multicellular organism 
development" (GO: 0007275, Z = 4.18, adj. p = 9.35 x 107%), 
“regulation of multicellular organismal development" (GO: 
2000026, Z = 3.64, adj. p = 4.45 x 107), and “regulation of multi- 
cellular organismal process” (GO: 0051239, Z = 3.36, adj. 
p = 5.98 x 105). 


This relatively large set of GO terms for high-purity PFAMs in 
Phaeophyceae represents their specialization into specific gene 
sets facilitating multicellularity compared to the other two macro- 
algal clades. Lineages with large influxes of new genes, such as 
the rampant lysogenic virus genome insertions observed in Ecto- 
carpus, may have locked themselves into multicellular modes. In 
contrast, many other lineages have representatives that reverted 
from multicellularity to unicellularity. The reversion from multicel- 
lularity to unicellularity has apparently occurred in cyanobacteria, 
green algae, and filamentous fungi, while no examples of 
reversion have been seen in metazoans, land plants, or Phaeo- 
phyceae (Peter et al., 2023). Our data provide further evidence 
for the uniqueness of the genetic basis for multicellularity in the 
Phaeophyceae. 


Chlorophytes were enriched for 28 GO terms at >80% PFAM pu- 
rity, including “shoot system development” (Z = 9.76, adj. p = 
7.96 x 10°), “flower development” (Z = 8.85, adj. p = 1.97 x 
107^, “plant organ development” (Z = 8.21, adj. p = 1.31 x 
1075, “reproductive shoot system development" (Z = 8.23, adj. 
p = 3.49 x 107^, “response to jasmonic acid" (Z = 7.67, adj. 
p = 2.57 x 107°), and “positive regulation of G-protein coupled 
receptor protein signaling pathway” (Z = 5.93, adj. p = 9.72 x 
1075). PFAMs comprising these GO terms are implicated in mo- 
lecular processes specific to chlorophytes. For example, the 
GO term for phyllome development (enriched at Z = 8.27, adj. 
p = 6.17E-04, related to GO:0010432, bract development; 
GO:0048440, carpel development; GO:0048366, leaf develop- 
ment; GO:0048441, petal development, GO:0048442, sepal 
development; and GO:0048443, stamen development) is 
composed of 12 PFAMs and five of these were specific to chlor- 
ophyte macroalgae: a TF (PF00847; found in ethylene-inducible 
TFs) (Ohme-Takagi and Shinshi, 1995) a K-box domain 
(PF01486; found in SRF-type TFs, e.g., the Arabidopsis floral 
homeotic gene PISTILLATA) (Goto and Meyerowitz, 1994), a 
B3 DNA-binding domain (PF02362, e.g., ARF1, a TF that binds 
to auxin response elements) (Ulmasov et al, 1997), and 
SQUAMOSA-pROMOTER BINDING PROTEINS, which are plant 
TFs (Klein et al., 1996). 


Many processes enriched for in chlorophyte macroalgae were 
conserved in higher plants, indicating that a propensity to 


(C) Genome sizes for the newly sequenced and reference macroalgae. C, Chlorophyta; O, Ochrophyta; R, Rhodophyta. 


(D) Exon counts for the newly sequenced and reference macroalgae. 


(E) BUSCOs (Manni et al., 2021) (96 present = 1 — missing) from the newly sequenced and reference macroalgae. 
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Figure 2. Lineage-specific protein families in macroalgal clades form the basis for their independent paths to multicellularity. 
(A) Ternary plot constructed from PFAM count averages for each of Chlorophyta (n = 11), Rhodophyta (n = 74), and Ochrophyta (n = 45). Information about 


the degree of uniqueness of PFAMs (purity) is shown. Axes indicate th 


e purity of the PFAM for the phyla shown on the triangle points. The plot is 
(legend continued on next page) 
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develop these enriched functions may have been latent in the last 
common ancestor of Chlorophyta seaweeds and land plants 
(~1.2 bya) (Hedges et al., 2004; Yoon et al., 2004; Zimmer 
et al., 2007; Parfrey et al., 2011; Yang et al., 2016). Indeed, in 
some Chlorophyta lineages evolving multicellularity, the base 
gene set prerequisite appears to have been possessed in 
plena by their unicellular progenitors (Prochnik et al., 2010). 
Recent phylogenetic reconstructions have suggested that the 
green macroalgal Ulva clades are paraphyletic (Fucíková K, 
2014; Fang et al, 2018; Hou et al, 2022) thus, similar 
phenotypes arose independently from the fertile, proto- 
multicellular chlorophyte gene repertoire. A recent study found 
transporters (e.g., nuclear pore proteins), hormone signaling 
components, and the TALE superclass of homeobox genes 
were expanded in the green alga Caulerpa lentillefera in a manner 
similar to land plants (Arimoto et al., 2019). These findings concur 
with our data, showing that many chlorophyte lineages already 
had a gene set primed for the evolution of multicellularity 
through plant-like pathways. 


Phaeophyte algae's specialization may partly stem from their 
recent emergence. In comparison, Chlorophyta and rhodophyte 
multicellularity originated over a billion years ago (Brawley 
et al, 2017) (Supplemental Figure 1), potentially obscuring 
informative genomic signals in our ternary analysis, except for 
shared Chlorophyta/Rhodophyta overrepresented PFAMs 
compared to Ochrophyta (Fig 2A). These PFAMs include tandem 
arrays of viral-origin reverse transcriptase (RT), RNAse-like, and 
integrase domains, often found near centromeres in microalgae 
and plants (US Patent WO2012061481A2). The rationale for re- 
taining these virus-related elements remains uncertain (Sharma 
et al., 2013), but our findings indicate they may constitute 
unique genomic backbones for Archaeplastida macroalgae. 
Conversely, Ochrophyta and Chlorophyta genomes had the 
most ankyrin repeats, primarily acquired after Ochrophyta- 
Rhodophyta and Chlorophyta-Rhodophyta splits. Our prior 
research showed strong positive selection in marine microalgal 
ankyrins (Nelson et al, 2021), implying potential fitness 
advantages for marine macroalgae. Our data suggest that 
these extensive, lineage-specific duplications of potentially 
adaptive and often virally sourced genes comprise core genomic 
features for members of these phyla. 


Our data reveal that the recent emergence of multicellular Phaeo- 
phyceae is reflected in their genomes through domain- 
centric functional enrichment (Figure 2B). In other words, the 
genes uniquely responsible for the Phaeophyceae rise to 
multicellularity are contained in these modular PFAM 
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sets. Their genomic signal detailing their route to, and 
maintenance of, a multicellular lifestyle is much newer than 
rhodophytes and chlorophytes. This signal was seen in the 
PFAM datasets, wherein a ternary analysis was sufficient 
to identify Phaeophyceae genes comprising GO terms 
for multicellular developmental processes. The pervasive 
sequence insertions containing EsV-1-7 and FNIP motifs in 
Ochrophyta were not seen in Chlorophyta and Rhodophyta ge- 
nomes. Phaeophyceae EsV-1-7 proteins included a variety of 
domains with putative or established roles in multicellular organ- 
ismal development (Supplemental Table 5). These results further 
evidence that virus-sourced gene collections underpin the evolu- 
tion of brown algal multicellularity. 


Protein family loss and gain during macroalgal evolution 


Comparisons between macroalgal chlorophytes, phaeophytes, 
and rhodophtyes could yield insight into unique and shared 
genes in these clades, differentiating them from their microalgal 
(i.e., microscopic, unicellular) counterparts. Gene circuits regu- 
lating the onset of complex multicellular development are 
conserved in fungal lineages (Nagy et al., 2018); here, we 
tested for the presence of similar PFAMs found in such circuits 
that could differentiate macro-from microalgae. Intersection 
analysis of these PFAM sets revealed their shared and unique 
retention (Figure 3A). Rhodophyta microalgae had 29 unique 
PFAMs, while their macroalgal counterparts had 883. Microalgal 
chlorophytes had 696 unique PFAMs, whereas their macroalgal 
counterparts had 158 unique PFAMs. The contrasting unique 
PFAM contents between Rhodophyta and Chlorophyta macro-/ 
microalgae suggest convergence in Chlorophyta seaweeds and 
functional divergence in Rhodophyta macroalgae, potentially as 
a function of their evolutionary time spent as multicellular 
organisms. 


Nine PFAMs were unique to microalgae, regardless of lineage 
(Figure 3A). These included the Parvovirus non-structural protein 
NS1, the TruB. C tRNA Pseudouridine synthase II, the YojJ bac- 
terial membrane-spanning protein, an alcohol dehydrogenase 
GroES-associated domain, a SUMO isopeptidase, a Rrp7 
RRM-like domain, and two domains of unknown function 
(DUFs) (DUF2202 and DUF6787). The photosystem Il 12-kDa 
extrinsic protein (PsbU, PF06514) was also lost in all chlorophyte 
(096 retention rate [RR]) and most ochrophyte 1696 RR) macroal- 
gae, but most rhodophyte macroalgae retained PsbU (6996 RR; 
Supplemental Figure 3, Supplemental Data 3). In contrast, 
PsbP was lost in ochrophyte macroalgae (p = 5.68 x 10^ 1^) but 
retained in chlorophyte and rhodophyte macroalgae. 


barycentric; three variables, represented by the average copy number of the PFAM, sum to a constant; bubble positions indicate copy number ratios (i.e., 
purity). Bubble size represents PFAM count totals for all phyla, where two is the smallest size, and 1000 is the largest. 

(B) Selected enriched GO terms from PFAMs filtered at 80% purity (n = 646) in the Phaeophyceae (n = 45) with roles in maintaining a multicellular 
phenotype. PFAMSs for cell organization and migration had high representation, alluding to roles for determining morphology and tissue organization in 
these phaeophyte algal genes. Contrasting with Ochrophyta and Chlorophyta, Rhodophyta (n = 74) displayed no GO enrichment in PFAMs with over 8096 
purity, possibly due to their longer evolutionary timeline and diversified traits, resulting in a generalist PFAM profile. 

(C) PFAM domains of unknown function (DUFs) from each phylum as compared with their unicellular correlates using false discovery rate (FDR)-corrected 
batch t-tests in phyla-wide response screens. The x axis categories indicate whether a DUF was present at significantly different or equivalent levels or if 


statistical resolution was absent (inconclusive). 


(D and E) Differential retention of key PFAMs in the three lineages. Each PFAM description includes multiple distinct PFAM domains differentially retained 
in the three lineages (see Supplemental Table 1). Most of the PFAMs distinguishing macroalgae from microalgae were predicted to function in the (D) 
membrane or (E) nucleus, outlining structural, signaling, and fundamental cell programming strategies toward maintaining multicellularity. 
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Hypergeometric tests were used to determine functional enrich- 
ment from unique and shared PFAM sets distinguishing macroal- 
gae from their microalgal counterparts (Figure 3). The 222 
macroalgal-specific PFAMs conserved in all three lineages 
were significantly enriched for 59 GO terms. Enriched GO 
terms related to morphology, extracellular matrix (ECM) forma- 
tion, cell-cell adhesion, the establishment of cell polarity, and 
signaling were recovered (Figure 3C). For example, “binding” 
(GO:0005488, Z = 3.2, adj. p = 1.62 x 107°), “cell periphery” 
(GO:0071944, Z = 3.49, adj. p = 1.18 x 107%), “apical part of 
cell" (GO:0045177, Z = 3.88, adj. p = 2.15 x 107%), “cell 
differentiation" (GO:0030154, Z = 4, adj. p = 2.54 x 1072), 
“cytoskeleton organization" (GO:0007010, Z = 4.15, adj. p = 
2.68 x 10-7), “cell-cell adhesion” (GO:0098609, Z = 5.27, 
adj. p = 3.12 x 10), “cell-cell recognition” (GO:0009988, Z = 
9.11, adj. p = 2.54 x 10°), and “multicellular organismal 
process" (GO:0032501, Z = 3.64, adj. p = 2.54 x 10^?) were 
GO terms associated with the enriched macroalgal PFAMs (See 
Supplemental Note 1A for expanded technical and contextual 
descriptions of PFAMs differentiating micro-from macroalgae 
and their GO enrichment). The high degree of enrichment in these 
proteins in macroalgae from any clade compared to their micro- 
algal counterparts suggests they are associated with macroalgal 
multicellularity evolution. The macroalgal genes underlying the 
enrichment are responsible for many processes and functions 
known to be important for multicellularity in other lineages 
(Steinberg, 1975; Hedges et al., 2004; Ratcliff et al., 2012; Nagy 
et al., 2018; Strother et al., 2021). 


The RRs of PFAMs in macroalgae (n = 124) were evaluated in a 
phyla-wide manner with batch t-tests (i.e., response screens) 
by comparison with microalgae from the corresponding clades 
(n = 125). We found 1059 instances of significantly different 
RRs (adj. p «0.05): chlorophytes had 10 losses and 61 gains, 
ochrophytes had 507 losses and 117 gains, and rhodophytes 
had 53 losses and 311 gains. There was a strong positive corre- 
lation between RR and clade age. Compared to microalgae, 
Chlorophyta and Rhodophyta macroalgae have expansions in a 
wide variety of gene families, indicating functional divergence af- 
ter the development of multicellularity (Figure 3B). In contrast, 


Figure 3. Shared and unique PFAMs in micro- and macroalgae. 
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unicellular stramenopiles had more diverse PFAMs than their 
macroalgal correlates (Figure 3B). The uniformity of coding 
potential in Ochrophyta macroalgae is likely due to their recent 
evolution from a common stramenopile ancestor (Cock et al., 
2010a, 2010b). 


Rhodophyta showed a drastic increase in the diversity and over- 
all quantity of PFAM DUFs, while Chlorophyta and Ochrophyta 
had fewer DUFs compared to their unicellular correlates 
(Figure 2C, Supplemental Table 3). The predominant features 
of the annotated PFAMs, expanded in macroalgae, were mem- 
brane proteins (e.g., transporters and adhesome components). 
We used a set of 183 transporter PFAMs to detect any signifi- 
cant (adj p «0.05) gains or losses. Ochrophyte, but not chloro- 
phyte or rhodophyte, macroalgae had significant expansions 
of a clan CLO184 transporter (PF07857, p = 1.32 x 10-5, 
Supplemental Table 3). Of these, 29 had significantly different 
retention in macroalgae compared to microalgae. An 
endoplasmic reticulum vesicle transporter (PF07970) was 
significantly expanded in Rhodophyta but not Chlorophyta 
or Ochrophyta macroalgae (p = 2.20 x 107794). A potassium 
(K*) transporter (PF02705), widely conserved in microalgae 
from all three phyla, was missing from chlorophyte and 
rhodophyte but retained in ochrophyte (eq. p = 1.95 x 107779) 
macroalgae. This K* transporter may have been a new 
introduction to microalgae after the major rhodophyte and 
chlorophyte seaweeds emerged, or it could have been lost 
from the older two macroalgal lineages. 


Several other transporters were expanded in one or more macroal- 
gal lineages. A branched-chain amino acid transporter was signif- 
icantly expanded in all three macroalgal phyla compared to their 
unicellular correlates (PF02653; Chlorophyta p = 8.99E—227, 
Ochrophyta p = 1.62E—30, Rhodophyta p = 2.48E-88). The 
need to transport amino acids is inherent in multicellular systems 
where nutrient absorption and energy expenditure are spatially 
segregated. A chromate transporter (PF02417) was expanded in 
Rhodophyta (p = 1.25E— 19), but not in Chlorophyta or Ochrophyta 
macroalgae compared to their unicellular correlates. A periplasmic 
zinc uptake complex component (ZnuA, PF01297, chelatase) 


(A) Intersection plot showing shared and unique PFAMs in chlorophyte, rhodophyte, and ochrophyte micro- and macroalgae. Only nine PFAMs were 
exclusively conserved in chlorophyte, rhodophyte, and ochrophyte microalgae (listed); 222 PFAMs were exclusively conserved in the three macroalgal 
lineages. Rhodophyta macroalgae had the most unusual PFAMs (883) compared to the other groups. 

(B) Protein family counts per macroalgal genome were compared for each PFAM domain (n = 12 938 PFAMs in 38 814 comparisons, Supplemental 
Table 3) between macroalgae and their microalgal counterparts in each phylum using FDR-corrected batch t-tests in phyla-wide response screens. 
Unicellular ochrophytes had a greater diversity of PFAMs than their multicellular counterparts, correlating with the short period in which the latter had to 
evolve. 

(C) Unique PFAM sets were used in domain-centric enrichment analyses. No GO terms were significantly enriched in most PFAM sets defined by phyla 
and multicellular phenotypes. The 222 macroalgal-specific PFAMs were used in dcGO (Fang and Gough, 2013) enrichment analyses to determine 
functional enrichment (GO terms; see also Supplemental Table 2). Enrichment of cell morphology, specialization, and multicellular organism 
organization GO terms was found in this set. The PFAMs constituting each GO term essentially comprise core gene collections unique to multicellular 
aquatic algae compared to their unicellular relatives but conserved in all three major phyla. Enrichment Z scores were plotted to Cytoscape network nodes 
in a model reconstructed from the relationships in the go.obo GO infrastructure file (https://doi.org/10.5281/zenodo.2529950). The Cytoscape network 
used to create this panel is in Supplemental Data 3 (.cys file). 

(D) KEGG map of chlorophyte macroalgal-specific metabolic pathways. The PFAMs varying with robust significance between macroalgae of the three 
clades compared to their microalgal counterparts were used to find Enzyme Commission (EC) (McDonald and Tipton, 2023) codes and mapped to 
pathways in the Kyoto Encyclopedia for Genes and Genomes (KEGG) using iPATHS3 (Letunic et al., 2008). The EC lists in Supplemental Table 3 can 
be processed in iPath3 (https://pathways.embl.de) (Letunic et al., 2008) for an interactive exploration of macroalgal-specific reactions. 

(E) KEGG map of rhodophyte macroalgal-specific metabolic pathways. 

(F) KEGG map of ochrophyte macroalgal-specific metabolic pathways. 
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was similarly expanded in Rhodophyta but not in Chlorophyta or 
Ochrophyta macroalgae compared to their unicellular correlates 
(adj. p = 9.23 x 10-"'). This observed expansion of a zinc chelation 
gene in Rhodophyta may be responsible for the increased zinc 
concentration found in red, but not green or brown, seaweeds 
(Stengel et al., 2004). Macroalgae can accumulate heavy metals 
from their environment. This accumulation is a result of various 
mechanisms, including adsorption on the cell surface and 
intracellular uptake, and has supported seaweed-based bioreme- 
diation efforts. Red macroalgae have already achieved moderate 
success in removing unwanted inorganic molecular (e.g., heavy 
metals) from sensitive ecosystems and industrial processes (He 
et al., 2008; Sode et al., 2013; Kidgell et al., 2014; Roberts et al., 
2015; Deniz and Ersanli, 2018; Elizondo-Gonzalez et al., 2018; 
Sun et al., 2019; Luo et al., 2020; Astrahan et al., 2021; Mansour 
et al., 2022; Znad et al., 2022). Our results show two expanded 
gene families in Rhodophyta macroalgae that could boost their 
bioremediation potential. 


Metabolic pathways gained in macroalgae during their 
evolution of multicellularity 


Macroalgae have evolved novel developmental programs coor- 
dinating morphogenesis. Cell polarity is established, and cell 
differentiation produces cell types underlying novel multicellular 
traits. For example, the millimeter-long holdfast rhizoid cells in 
Rhodophyta species (e.g., Porphyra [Brawley et al., 2017] and 
Griffithsia [Goff and Coleman, 1987]) facilitate survival in violent 
intertidal areas. Other cell-type specializations, such as the 
mat-forming stolons of Caulerpa cylindracea, promote invasive- 
ness by smothering and over-crowding other native macroalgae 
(Katz et al., 2021). Here, we show unique metabolic pathways 
distinguishing macroalgae from their unicellular relatives 
(Figure 3D-3F, Supplemental Table 3). A striking feature in 
metabolic maps of macroalgal-specific genes is the prevalence 
of polysaccharide metabolism pathways. These pathways 
contribute to complex, tissue-specific glycoprotein patterns 
that define organismal morphology and cell-development pro- 
cesses. Macroalgae rely on diverse sugars and sugar derivatives 
for specific ECM components (Domozych and Domozych, 2014; 
Kloareg et al., 2021; Mazeas et al., 2022). Our data revealed 
lineage- and habitat-specific sugars and broadly conserved 
sugar metabolism genes. Rhodophyte macroalgae exhibited 
significantly more glycosyl hydrolases and carboxylesterases 
than their microalgal correlates (adj. p «0.05), while carbonic an- 
hydrases were more abundant in macroalgal chlorophytes 
(PF00194, adj. p «0.05), likely for concentrating CO» in Ulvophy- 
ceae. See Supplemental Note 1B for expanded technical and 
contextual descriptions of macroalgal ECM analyses. 


ECM polysaccharides play crucial roles in cell adhesion me- 
chanics and provide protection for macroalgae from their sur- 
rounding environment. Components of seaweed ECMs, such 
as mannose derivatives, sulfated polysaccharides (SPs), and 
acidic-residue-rich linking proteins such as integrins, are 
indispensable for macroalgae, and sets of these genes distin- 
guish them from microalgae and land plants. SPs, which are 
particularly structurally diverse and unique to some marine 
plants and macroalgae, comprise numerous building blocks 
for tissue-specific ECMs. These adhesive molecules have 
been implicated for use in human bone tissue regeneration 
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(Venkatesan et al., 2019). Genes involved in mannose, another 
important ECM component, metabolism were conserved 
across all three macroalgal phyla examined, suggesting their 
significance for aquatic multicellular algal survival. Macroalgae 
lack glycosyltransferase activity required in complex-type 
N-glycan biosynthesis (Yoshiie et al., 2012). Mannose, and 
other osmolyte-type sugars, can also promote marine survival 
through conferring resistance to salinity. Although widely 
conserved in seaweeds, sparse examples of the mannose me- 
tabolism’s role in salt stress resistance in land plants have 
been reported (Hirano et al., 2000; Ma et al., 2014; Yu et al., 
2021). Thus, the mannose metabolic pathways reported here 
may be of value for crop-improvement strategies. 


Macroalgal-specific PFAMs shared in all three lineages reveal 
convergence toward these functions (Figure 3). We used 
this set of 222 PFAMs as queries in a dcGO enrichment 
analysis and found enrichment in GO terms related to cell 
membrane reorganization, cell polarity establishment, cytoskel- 
eton rearrangement, and cell specialization in vascular 
tissue (Supplemental Table 2). Genes related to mannose 
and sulfated polysaccharide metabolism also distinguished 
them from microalgae. We found a significantly higher number 
of sulfatase domains in rhodophyte macroalgae than in 
microalgae (PF00884, clan = CLO088, adj. p = 1.43 x 107°, 
Supplemental Table 3). Land plants generally lack SPs and 
high-mannose cell walls. However, angiosperm seagrasses 
such as Hydrocharitaceae, Ruppiaceae, Posidoniaceae, 
Cymodoceaceae, and Zosteraceae contain both SPs and high- 
mannose-type N-linked glycans. The convergent evolution of 
these ECM components among photosynthetic, marine, and 
multicellular eukaryotes from diverse lineages implies their 
adaptiveness in a marine environment. Conversely, Rhodophyta 
display remarkable complexity in ECM-related pathways, 
setting them apart from the other two lineages. These ECM 
gene collections from Rhodophyta macroalgae were some of 
the most prominent examples of evolutionary divergence seen 
in our datasets. Presumably, the much longer evolutionary 
period of the Rhodophyta macroalgal clades allowed for the 
development of more complex ECM biosynthesis gene reper- 
toires. We speculate that the greater number of glaciation cycles 
experienced by the Rhodophyta drove the higher-complexity 
ECM systems inferred from our data, as an adaptation to 
freeze-thaw cycles requires flexible cell-cell adhesive networks. 


Macroalgal multicellularity facilitated by cell polarity, 
communication, and adhesion 


Macroalgae exhibit complex life cycles that vary among species. 
Pioneering research into the life cycles of model phaeophyte 
macroalgae, such as Ectocarpus sp. (Muller, 1966, 1967b, 
1967a, 1968) and members of the Fucales (Vreeland et al., 
1993; Kropf, 1997) and Dictyotales orders (Charrier et al., 
2017), has uncovered intricate developmental strategies. 
Typically, macroalgae progress through four multicellular 
stages, including both fertile and vegetative gametophytes and 
sporophytes (Yang et al., 2016; Macaisne et al., 2017; Arun 
et al., 2019). In some taxa, sexual reproduction polarity is 
induced at the zygotic phase (Kropf, 1997), and polarized 
embryos develop following zygote fertilization (Charrier et al., 
2017). One key feature of macroalgal cells distinguishing them 
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from microalgae is their extensive cell-cell communication net- 
works. Our dataset suggests that these networks partially arose 
due to gene gains compared to their unicellular ancestors. The 
thrice-conserved, macroalgal-specific 222 PFAM set showed 
dcGO enrichment for “cell-cell recognition" (GO:0009988, 
Z = 9.11, adj. p = 2.54 x 10-2; Supplemental Table 2), which 
included an FAD-dependent oxidoreductase (IPR006076). 
Various redox processes are involved in cell-cell communication 
and are thought to have been essential in animal multicellularity 
development (Blackstone and Ellison, 2000). Thiol-based sen- 
sors constitute central machinery in plant defense signaling net- 
works (Chae et al., 2023). We observed significant expansions in 
copper zinc superoxide dismutase (PF00080, Sod Cu) in Rho- 
dophyta (adj. p = 4.81 x 1075) but not in Chlorophyta or Ochro- 
phyta macroalgae (see also Supplemental Table 3). Ochrophyta, 
but not Rhodophyta or Chlorophyta, showed significant 
expansion of a thioredoxin gene (PF00085, p = 1.9 x 1075). 
These distinct expansions in the Rhodophyta and Ochrophyta 
macroalgae highlight non-plant mechanisms for redox sensing 
in multicellular phototrophs. 


Macroalgae exhibit complex cell-to-cell adhesion, facilitated by 
elaborate ECMs or direct connections, a hallmark of multicel- 
lular development. Proteins and polysaccharides, collectively 
known as the adhesome, mediate these processes (Whittaker 
et al., 2006; Zaidel-Bar et al., 2007,; Zaidel-Bar and Geiger, 
2010; Zaidel-Bar, 2013; Winograd-Katz et al., 2014). 
Coordinated cell-to-cell adhesion, facilitated by complex 
ECMS or direct connections, is another hallmark of multicellular 
development (Domozych and Domozych, 2014; Brunet and 
King, 2017; Kloareg et al., 2021). Variations in cell-cell adhesion 
mechanisms have been proposed to underscore the progres- 
sion from simple to complex multicellularity (Strother et al., 
2021) and our data support this hypothesis. Thus, we 
interrogated our genome datasets for genes involved in 
adhesion that could facilitate multicellularity, especially those 
implicated in developmental and cellular differentiation 
processes. The multistage adhesive process starts during the 
early development of zygotes and influences the survival rates 
and reproductive success of macroalgae (Vreeland et al., 
1993; Matt and Umen, 2016), We defined the macroalgal 
adhesome as including cadherin, integrin, armadillo (ARM), 
lectin, F actin-binding calponin homology, Kringle, Pleckstrin 
homology (PH), Src homology 2 (SH2), and LIM PFAMs (n = 
110 PFAMs; see Supplemental Table 4). Other enzymes, such 
as guanine nucleotide exchange factors and GTPase- 
activating proteins, protein tyrosine, serine/threonine kinases 
and phosphatases, and E3 ligases and proteases, regulate 
adhesion by post-translational modification and do not directly 
adhere to other proteins. Differences in levels of these proteins 
in macroalgae compared to microalgae were seen in our 
response screen (Supplemental Table 3); here, we focused on 
PFAMs with the potential to bind to proteins presented by 
other cells or in the ECM (i.e., cell-cell and cell-matrix) and 
their accessory domains, or codomains, which regulate 
and fine-tune adhesive interactions (Zaidel-Bar and Geiger, 
2010; Zaidel-Bar, 2013; Collins et al., 2017; Hernandez- 
Vasquez et al., 2017; Thompson et al. 2021). From 110 
PFAMs constituting the queried adhesome (Supplemental 
Table 4), 146 539 PFAMs in 124 queried macroalgal genomes, 
or ~1154 adhesome PFAMs/genome, were recovered 
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(Supplemental Table 4). Several macroalgal cell-cell linkage 
and communication genes were either absent or present in 
higher copy numbers than microalgae (Supplemental Table 4). 
Our adhesome analyses point to a significant role for these 
genes/domains in the evolution of multicellularity in macroalgae. 


We analyzed genes containing these domains for codomains. 
Altogether, 7065 adhesome codomains or ~55.6 adhesome co- 
domains were detected per genome. These codomains likely 
extend the functionality of their proteins, providing combinatorial 
complexity to overcome a wide variety of biotic and abiotic chal- 
lenges faced by their host lineages. We compared the adhesome 
across macroalgal phyla, native climates, and habitats (Figure 4). 
We saw 294 adhesome proteins that significantly differed in 
copy number between phyla and 770 that significantly varied 
by climate (adj. p «0.05). The 5365 significant differences in 
adhesome PFAMs by phyla within habitats (Supplemental 
Table 4) suggest that environmental cues had shaped adhesome 
development in a convergent manner in the three macroalgal 
phyla. Still, we found that 129 PFAMs significantly differed among 
phyla within their respective habitats. These lineage-specific ad- 
hesome components detail the genetic blueprints for multicellular 
development in the rhodophyte, chlorophyte, and ochrophyte sea- 
weeds. Rhodophyta species had more extensive collections of 
adhesome elements, and subpolar species from any clade had 
substantially less. A minimal adhesome gene set in subpolar algae 
suggests that fewer genes for these proteins may be necessary in 
subpolar regions. 


The compendium of cadherin codomains seen here suggests 
macroalgae exploit the capacity of this network to use mecha- 
nisms outlined in the differential adhesion hypothesis 
(Wiseman, 1977; Foty and Steinberg, 2005; Amack and 
Manning, 2012) for complex tissue and body plan 
organization as well as climate-specific adaptation (see also 
Supplemental Table 4). For example, epiphytic phaeophyte 
macroalgae had significantly more Laminin G 3 domains than 
epiphytic rhodophytes (adj. p = 0.03). Phaeophyte algae had 58 
adhesome domains with significantly different levels than the 
other phyla, with 69 comparisons showing significantly fewer 
adhesome domains than other clades. Ochrophyta macroalgae 
had adhesome gene expansions, including only five PFAMs. 
These consisted of the intraflagellar transport 81 calponin homol- 
ogy domain (IFT81 CH, PF18383, increased in phaeophyte 
compared to [1] chlorophyte epiphytes, adj. p = 1.07789 x 
10715: and [2] rhodophyte epiphytes, adj. p = 3.26949 x 10°75), 
calponin homology (CH) domain (PF00307, increased in 
mangrove-dwelling phaeophytes compared to chlorophytes, 
adj. p = 0.0008), and the concanavalin A-like lectin/glucanases 
(Laminin G 3, higher in Rhodophyta [epiphytic], adj. p = 0.031). 


Comparisons of adhesome domains in the three macroalgal 
clades compared to their microalgal counterparts showed 191 
significant differences (adj. p «0.05). These results outline various 
mechanisms for tissue formation and organization of multicel- 
lular, marine-dwelling photosynthetic eukaryotes. Chlorophytes 
(n 2 11) had 15 adhesome domains with significantly different 
levels than their microalgal correlates. Of these, 14 were present 
at increased levels. An acetamidase formamidase (FmdA  AmdA, 
PF03069) adhesome codomain was present at significantly fewer 
levels in macroalgae, and this may represent a reduced need 
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Figure 4. The macroalgal adhesome as a function of phyla, habitat, and multicellularity. 
(A) Macroalgal adhesome PFAMs were compared across phyla and habitats (n = 124 genomes) in FDR-corrected batch t-tests (Supplemental Table 4). 
The PFAMs selected from Supplemental Table 1 comprising the adhesome were cadherin, integrin, armadillo (ARM), lectin, F actin-binding CH, Kringle, 
PH, Src homology 2 (SH2), and LIM PFAM domains, altogether consisting of 110 distinct HMM-based PFAM domains. All data points shown are 
significantly different at adj. p «0.05. Groups of macroalgae listed in the top boxes were compared to those listed along the x axis, where a positive 


number signifies an increase in the top level. 


(B) Prediction profiles for whether a subject is microalgal or macroalgal as a function of adhesome PFAMs. These variables contribute the most toward 
classification in a deep neural network fitting adhesome PFAMs from the macro- and microalgal genomes. The charts are marginal plots describing the 
effect of increasing or decreasing PFAM domain counts in the macroalgal groups as predictive of either macro- or micro-cellularity status. 

(C) Attribute influence (i.e. variable importance) shows the predominant decision-making adhesome PFAMs defining macroalgal multicellularity. These 
adhesome constituents served as adequate markers for the transition to multicellularity in macroalgae. 
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to assimilate nitrogen from these sources in multicellular chloro- 
phytes. Ochrophyta macroalgae (n = 45) had the most signifi- 
cantly differing adhesome codomains (116 comparisons with 
adj. p «0.05) compared to their unicellular ochrophyte correlates. 
Differing domains included two immunoglobulins (lg. 3, 
PF13927 and Ig. 2, PF13895) as well as various TFs and kinase 
adhesome codomains. For example, Ochrophyta macroalgae 
had significantly more diaphanous FH3 domains (Drf FH3, 
TPR-like, clan = CL0020, PF06367). This domain is a core 
component of actin regulatory genes and has a wide-reaching in- 
fluence over cell morphology and adhesive properties. 


Macroalgal adhesome domain composition (including cadherin, 
integrin, ARM, lectin, F actin-binding CH, Kringle, PH, SH2, and 
LIM PFAMs; n = 110 PFAMs) was sufficient to build an artificial 
neural network (aNN) that could accurately determine the unicel- 
lular or multicellular status of an alga (R? = 92.9; Figure 4, 
Supplemental Figure 4). This aNN was composed of 24 layers 
of 13 nodes, each consisting of eight tanH, four linear, and one 
Gaussian activation function (8-4-1 TanH-L-Gx24). The predic- 
tive capacity of these PFAMs in our aNN point to macroalgal ad- 
hesomes being critical factors for their multicellularity status. 


Cadherins had very heterogeneous distributions in the macroal- 
gal genomes, with only 85 out of 124 genomes surveyed contain- 
ing at least one cadherin domain and some species showing sub- 
stantial expansions in cadherins. The sparsity shows that 
cadherins may not be necessary for a multicellular phenotype in 
macroalgae. This finding contrasts with studies showing that 
cadherins alone are capable of driving complexity in emerging, 
evolving multicellular lineages (Foty and Steinberg, 2005) and 
indicates that a wide variety of cell-cell adhesion mechanisms 
can support a multicellular lifestyle. Adhesome networks of cad- 
herins, collectively known as the cadhesome (Zaidel-Bar, 2013), 
are generally characterized by many more linkers than nodes. 
Non-cadherin PFAMs accompanying cadherin domains in the 
macroalgal genes coded for diverse functions. Many cadherin 
codomains had other similar binding and receptor functions. Co- 
hesins, Kringle domains, Dockerins, Sushi domains, the bilobal 
receptor L domain, and various lectins were cadherin codomains. 
These codomains play dominant roles in cell-cell adhesion dy- 
namics and intracellular communication, including mechano- 
transduction (Campbell and Humphries, 2011; Tong et al., 
2018; Zheng and Leftheris, 2020) and small-molecule sensing 
(Forzisi and Sesti, 2022). Some macroalgal cadherins had TIG/ 
IPT domains characterized by immunoglobulin-like folds. These 
TIG domains are commonly found in cell surface receptors 
such as Met and Ron and DNA-binding TFs (Aravind and 
Koonin, 1999; Bork et al., 1999). 


Differential binding from cadherins, integrins, and other adhesive 
proteins in differentiated cell types is the basis for the differential 
adhesion hypothesis (Steinberg, 1975; Wiseman, 1977; Nicol and 
Garrod, 1979; Thomas and Yancey, 1988; Foty and Steinberg, 
2005; Emily and Francois, 2007; Amack and Manning, 2012). 
This hypothesis outlines mechanisms for differentiated cell 
sorting, tissue morphogenesis, rearrangements, spreading, and 
segregation. It posits that cadherins' type, activation state, 
and density directly mediate cellular organization and organism 
body plan by minimizing adhesive-free energy through opti- 
mizing cell-to-cell binding (Foty and Steinberg, 2005). Thus, 


758 Molecular Plant 17, 747-771, May 6 2024 © 2024 The Author. 


Macroalgal deep genomics 


different types of cadherins in the macroalgal genomes likely 
play fundamental roles in their respective lineages' tissue 
morphogenesis and overall morphologies. We analyzed the 
macroalgal cadherins and other adhesome components for 
other concurrent PFAM domains to identify lineage- and 
species-specific cell connection-related PFAMs that may be 
responsible for their multicellular morphogenic body plans. See 
Supplemental Note 1C for expanded descriptions of cadherin 
and other adhesome codomains in the macroalgal genomes. 


Changes in integrin and cadherin rigidity sensing regulate 
cell adhesion properties (Collins et al., 2017). This rigidity 
modulation regulates actin organization and membrane 
dynamics during cell-cell adhesion. Different domains in these 
proteins could alter how mechanotransduction signals are 
interpreted (e.g., through TF signaling) and influence cellular po- 
larity determination. For example, vast hierarchies of TFs regu- 
late macrophage polarization in metazoans (Czimmerer and 
Nagy, 2023). In instances where cadherin codomains bind 
specific macromolecules, such as Kringles, which bind to cell 
membranes, the cadherins and integrins can more precisely 
determine the organisms' physiology by requiring a body plan 
that conforms to its macromolecular binding requirements. 
Dynamic cellular adhesion by the “mix-and-match” design 
might facilitate resilient survival in seaweeds clinging to 
intertidal rocks despite violent waves and daily dehydration in 
baking sunlight. The evolution of integrins and cadherins with 
exotic codomains could provide unexpected functional 
benefits through far-reaching consequences in an organism's 
body plan and resultant morphology. Our data reveal the combi- 
natorial adhesome complexity facilitating the evolution of novel 
multicellular phenotypes in three macroalgal lineages. 


Macroalgal endogenous viral elements and concurrent 
functional domains 


The interaction of early eukaryotes with viruses has been hypoth- 
esized to favor the evolution of multicellularity as a defense 
against infection (Iranzo et al., 2014). Here, we examined the 
role of viruses in the evolution of macroalgal multicellularity, not 
through selection for immune systems but as a means of 
exchanging genetic information among lineages and conferring 
gain-of-function mutations. We examined viral elements in the 
coding sequences (CDSs) from the macroalgal genomes using 
HMMsearch (Potter et al., 2018) as in Skewes-Cox et al. (2014) 
and Nelson et al. (2021). The absolute quantities of their 
inclusion were not necessarily clade dependent, as clades had 
highly heterogenous inclusion of viral families (VFAMs), and no 
phyla surveyed had significantly different quantities of VFAMs 
from the other two (Figure 5, Supplemental Table 5, VFAMS). 
We observed clade-dependent (i.e., high conservation, retained 
by most members) and clade-independent (e.g., sporadic or 
correlating with native climate) virus sequence integration in 
the macroalgal genomes (see Supplemental Note 1D for 
expanded technical and contextual description of the global 
VFAM analyses). 


EsV-1-7-like domains are sporadically found throughout eu- 
karyotes; previously, only two chlorophyte species and one 
cryptophyte species were known to share EsV-1-7-like do- 
mains (Macaisne et al., 2017). EsV-1-7 proteins represented 
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Figure 5. VFAM sequences in macroalgal genomes and their variance with clade and climate. 
(A and B) Comparison of total VFAMs in macroalgal genomes from Rhodophyta (n = 74), Ochrophyta (n = 43), and Chlorophyta (n = 11). Rhodophyta had 
the highest average VFAMs per genome, followed by Chlorophyta, then Phaeophyceae. No phyla significantly differed from another in terms of average 
VFAMS per genome. 
(C) Comparison of average VFAMs per genome by native climates. Data points represent genome differences in standardized residues of VFAM means 
compared between the climates listed. Large black dots represent significant differences (adj. p «0.05). 
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the highest copy number PFAM for the Ochrophyta macroalgal 
genomes differentiating them from the other phyla (Figure 2A, 
largest bubble in Ochrophyta >80% purity). Phaeophytes had 
293.5 + 229.4 EsV-1-7 domains per genome (n = 43). The 
high Esv-1-7 counts in Phaeophyceae compared to the very 
low or nonexistent EsV-1-7 domains in other macroalgae 
establish Esv-1-7 motif retention as a defining feature for the 
clade. Four Rhodophyta species were also found with about 
five EsV-1-7 motifs per genome: Porphyra plocamiestris 
(CCMP673), Catenella sp. (CCMP2858), Porphyra miniata 
(CCAP1379/2), Caloglossa vieillardii (CCMP3076), and Polysi- 
phonia violacea (CCAP1348/5). The connection of Esv-1-7 
domains and various proteins necessary for a multicellular 
phenotype suggests that the acquisition of lysogenic virus- 
Sourced genes comprises essential genomic architecture in 
the Phaeophyceae. 


We surveyed genes with EsV-1-7 domains for accessory do- 
mains, or codomains (Figure 6). Altogether, 931 codomains 
were found in 13 696 EsV-1-7 motif-containing genes (n = 43 
phaeophyte algal genomes surveyed). Codomains included the 
Rieske domain (PF00355.2, E - 7.9 x 1078), which is involved 
in electron transfer in many systems (Dong et al., 2019; Feyza 
Ozgen et al., 2020; Mei et al., 2020; Nji Wandi et al., 2020; 
Purhonen et al., 2020; Hou et al., 2021; Di Trani et al., 2022). 
The acquisition of Rieske domains and their long-term mainte- 
nance indicates a fundamental shift in the biological capacity of 
the host lineage. Retained EsV-1-7 intact codomains may offer 
new metabolic possibilities for their hosts. For example, Rieske 
proteins have conferred photosynthesis to non-photosynthetic 
cells (Feyza Ozgen et al., 2020). Such a drastic phenotype 
change was observed experimentally, but the long-term geolog- 
ical implications of massive arrays of viral-sourced genes in mac- 
roalgal lineages have not yet been appreciated. We found 
Rieske-Esv-1-7 codomains in 23 of the 43 Phaeophyceae 
genomes, suggesting that the function of this viral-sourced 
gene is partially conserved. See Supplemental Note 1E for an 
expanded description of EsV-1-7 codomains and gene 
structures. 


The other prominent viral-origin domain found in Phaeophyceae 
genomes was the FNIP motif. We saw 4478 FNIP domains in 
phaeophyte algal genomes, with 195 accessory codomains 
sporadically distributed throughout genes. Plectins comprised 
44 FNIP codomains, suggesting a conserved role for FNIP- 
Plectins. Plectins link actin microfilaments, microtubules, and 
intermediate filaments and maintain the mechanical integrity 
and viscoelasticity in tissues (Broussard and Green, 2022; 
Marks et al, 2022; Prechova et al., 2022; Wiche, 2022). 
Protein kinases were also FNIP codomains in phaeophyte 
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genomes. A protein tyrosine and serine/threonine kinase 
(PF07714) and the protein kinase PF00069 were FNIP 
codomains. As a representative example, gene 4491 of scaffold 
32 in the Ectocarpus siliculosus genome (Cock et al., 2010b) 
had FNIP motifs adjacent to a protein tyrosine and serine/ 
threonine kinase (PF07714) domain. These types of genes were 
conserved endogenous viral elements in Phaeophyceae 
genomes. Protein tyrosine and serine/threonine kinases were 
some of the most significantly varying endogenous viral-origin 
proteins (EVOPs) among native macroalgal climates; they may 
provide climate-specific fitness. 


The ratio of nonfunctional to functional elements accompanying 
EsV-1-7 domains was determined in selected genomes (see 
Supplemental Table 5). The phaeophyte alga Botrytella 
uvaeformis (CCAP1311/1) had 491 Esv-1-7 domains. Only six 
EsV-1-7 codomains were found: two Rieske (PF00355.29) do- 
mains, a zinc-finger (zf-MIZ, PF02891.23), an E3_ubiquitin ligase 
(RBR, PF18091.4), and two DUF5025 (PF16428.8) domains. This 
ratio implies a high number of viral insertions, but few of these se- 
quences had known functions. Similar conclusions could be 
made from the manual inspection of ~800 Interproscan search 
results for EsV-1-7 genes with codomains. Only ~5% of these 
genes had defined functions. Most were characterized by tracks 
of EsV-1-7 motifs (6-20 copies of each protein) and a short disor- 
dered or predicted transmembrane region at either terminus. 
EsV-1-7 motifs were found on predicted intracellular and 
extracellular domains, suggesting that these proteins have 
signaling roles or may even function as structural proteins in 
Phaeophyceae. 


Strikingly, we also observed EsV-1-7 cysteine-rich repeats as 
FNIP codomains. The FNIP motif is not coded for by EsV-1 and 
is instead found in members of the Megaviradae. Their coexis- 
tence in an open reading frame (ORF) alludes to a chimeric origin. 
The Megavirus baoshan has 82 protein-coding genes containing 
FNIP repeats (Xia et al., 2022). Most of these FNIP-containing 
proteins had terminal F-box domains, and up to 28 FNIP copies 
are in each gene. Interactome analysis of the FNIP protein 
Mb0983 showed how M. baoshan exploits its host's ubiquitin- 
proteasome pathways during infection. Specifically, FNIP- 
Mb0983 interacts with the GTPases Rap1B and Rab7A in the 
cytoplasm of Acanthamoeba castellanii (Xia et al., 2022). Thus, 
a viral-sourced, permanently acquired reprogramming of cell 
fate, regulated in a dose-dependent or on/off manner, could 
lead to differentiated cell types. Our data also suggest that meg- 
aviruses have provided phaeophyte algae with plectin genes. 


Metabolic complexity appears to have increased in all the major 
macroalgal lineages, partly because of virus gene donation. 


(D) Viral families (VEAMSs, x axis) in species (y axis) from each of the three phyla. VFAMs were hierarchically bi-clustered using average Euclidean distance; 
the Z score correlates with relative VFAM counts. A cluster of VFAMs mostly unique to phaeophyte algae are shown in the dashed line box. These VFAMs 
are also found in viruses infecting Rhodophyta and Chlorophyta algae, which hints at cross-phyla infection; however, these specific VFAMs were not in 


the Rhodophyta or Chlorophyta algal genomes. 


(E) Rhodophyta showed two major clusters of VFAMs, mostly unobserved in the other phyla. 

(F) Phaeophyceae showed two major clusters of VFAMs, mostly unobserved in the other phyla. 

(G) Effect size and FDR logworth of habitat variance within phyla on VFAM contents, where logworth = —log10(p value) and the effect size is the 
normalized mean of the sum of squares. The strongest effect sizes were seen in Rhodophyta VFAMs, which can be explained by the greater time over 
which they evolved niche-specific adaptations. VFAMs highlighted here include VFAM 1705 (Prasinovirus), VFAM 964 (Ictalurivirus), and VFAM 681 
(Orbivirus). Extended descriptions of VFAMs from these analyses are in Supplemental Table 5, Supplemental Data 3, and are also at derisilab.ucsf.edu. 
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Figure 6. Phaeophyceae EsV-1-7 cysteine-rich motifs were found in macroalgal genes coding for a wide variety of functions (n = 43 


genomes surveyed). 


The FNIP repeat domain was also observed as a predominant feature in many Phaeophyceae genomes. Both motifs are common in viruses and amebae, 
and our data reveal that these sequences are conserved in Phaeophyceae as well. These motifs were found in a wide variety of genes with functions 
relating to molecular and physiological niche adaptation strategies, especially as they pertain to phaeophyte macroalgae. 

(A) EsV-1-7 domain counts in phaeophyte macroalgae from different climate zones. A single line signifies that only one species was sampled in that zone. 
Our data show that ~300 EsV-1-7 domains could be recovered from each Phaeophyceae genome. 

(B) Genes with VFAMs that also contained PFAMs (EVOP families) were compared between phyla (Supplemental Table 7). Phaeophyceae EVOPs were 
present at significantly higher levels than the other two clades, with FNIP and EsV-1-7 domains seen in drastically higher quantities than other EVOPs. 
(C) Heatmap showing EsV-1-7 codomains. Remarkably, the multifunctional Rieske PFAM domains were common EsV-1-7 codomains. 


Algal viruses can serve as metabolic gene reservoirs. For 
example, the Chlorovirus PBCV-1 has more than 400 protein- 
encoding genes, many of which are completely unexpected 
for a virus (Van Etten and Dunigan, 2012). Our past work with 
microalgal genomics revealed a vast network of viral gene 
transfers; here, we show that such donations may have 
generated scaffolds for complex multicellular developmental 


processes. The main contributions from viruses in each of the 
clades seem to be in redox balancing, transcriptional 
regulation, and metabolic flexibility. Further investigation of the 
role of viruses in the origin of subsequent adaptive radiations 
in especially the brown algae could provide valuable insights 
into a clade of growing environmental and industrial 
significance. 
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In summary, our work has successfully established a vast and 
varied genomic resource encompassing macroalgae from a 
wide range of phyla, climates, and ecological niches. Through 
meticulous comparative analyses, we have identified specific 
gene sets that elucidate both convergent and divergent 
evolutionary trajectories, contributing to the emergence of multi- 
cellularity and the rich morphological diversity observed in these 
organisms. We have highlighted critical genes and domains inte- 
gral to processes such as cell adhesion, communication, polarity, 
and differentiation, which are fundamental in the development 
and sustenance of the complex multicellular structures in macro- 
algae. A significant emphasis of our study was on the role of 
endogenous viral elements and EVOPs, which play a pivotal 
role in the genomic diversification and innovative capacities of 
macroalgae. The datasets presented in this study not only 
deepen our understanding of the genetic underpinnings and envi- 
ronmental adaptability of macroalgae but also lay a robust foun- 
dation for future explorations and applications in the realm of 
genome-based macroalgal biology. 


METHODS 


Macroalgal cultures and sampling 


The culture collection centers CCAP (www.ccap.ac.uk) and the 
National Center for Marine Algae and Microbiota (NCMA/CCMP; 
ncma.bigelow.org) provided the global macroalgal cultures; 
CCAP, 73 cultures and NCMA/CCMP, 39 cultures. This study 
has sequenced the genomes of all the available macroalgal cul- 
tures at the culture collection centers CCAP and CCMP between 
2019 and 2021. The strains we used in this study covered a 
wide range of geographic locations, environments, and climatic 
zones (Supplemental Table 1). These included Rhodophyta 
(n = 65, excluding parasitic and unicellular species), Chlorophyta 
(n = 11, excluding higher plants, unicellular species, and 
seagrasses), and Ochrophyta macroalgae (n = 36, from the 
Ochrophyta phylum). These macroalgae were isolated from 
marine (n = 97), freshwater (n = 10), and brackish (n = 5) 
environments. Their native climates varied from polar, subpolar, 
temperate, subtropical, and tropical (Supplemental Table 1). 
Additionally, nine marine macroalgae (Chondria dasyphylla, 
rhodophyte; Avrainvillea amadelpha, chlorophyte; and Padina 
boergesenii, Polycladia myrica, Sargassum latifolium, Sargassum 
angustifolium, Canistrocarpus cervicornis, and two unidentified, 
phaeophytes) from the UAE (https://doi.org/10.5281/zenodo. 
7158508) were included in our analyses. All the strains were 
cultivated under 10 pmol photons/m?/s at 13 h light:11 h dark 
cycle at NCMA; and under 20 umol photons/m?/s at 12 h 
light:12 h dark cycle at CCAP. The NCMA cultures were 
maintained in sterile Magenta GA-7-3 vessels with 0.5 L of 
medium (Supplemental Table 1). Cultures were axenically 
transferred every 12 weeks by dissecting a piece of macroalgal 
tissue and placing it in a new culture container of freshly filtered 
seawater with appropriate medium components aseptically 
added at the time of transfer. 


DNA extraction, sequencing, and de novo genome 
assembly 


Most DNA extractions were done using a phenol-chloroform 
protocol with the CTAB (cetyltrimethylammonium bromide; 
Sigma-Aldrich, St. Louis, MO, USA) extraction buffer (Phillips 
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et al., 2001). The DNA from nine macroalgal culture samples 
(Supplemental Table 1) was extracted with the Dneasy Plant 
Pro Kit (Qiagen, Hilden, DE). We found that the traditional CTAB 
DNA extraction protocol yielded higher DNA concentrations 
and qualities. Sequencing of the 110 macroalgal genomes 
was performed with Illumina NexSeq 550, HiSeq 2500, and Nova- 
seq PE150, aiming for ~100x coverage in each assembly. We 
performed a benchmarking study using different assemblers. 
The results showed that CLC assembler (QIAGEN, 2022) 
yielded optimal assemblies with regard to Benchmarking 
Universal Single-Copy Orthologs (BUSCO, version 5.5.0) 
(Simao et al., 2015; Manni et al., 2021) scores and contig length 
(e.g., N50 values; Supplemental Figure 1). Hence, the CLC 
assembler was used for the 110 macroalgal assemblies here 
with three different k-mer sizes (40, 50, and 60). The best 
k-mer size for a given assembly was similarly evaluated, and 
the best assembly was used for downstream analyses. The 
macroalgal genomes with the highest N50s were of Ulvella 
leptochaete (chlorophyte, CCAP6037/1, N50 = 370 kbp) and 
Chroothece richteriana (rhodophyte, CCAP1353/4, N50 = 322 
kbp). Phaeophyte macroalgae had higher average N50s 
(Supplemental Figure 1). 


After decontamination, translated CDSs from the newly 
sequenced and reference genomes were analyzed with 
BUSCO to determine genome completeness using the latest 
(v5.5) BUSCO Docker (https://www.docker.com) container 
in Singularity (https://sylabs.io/) with the command “singu- 
larity run busco_v5.5.0_cv1.sif && singularity exec busco_ 
v5.5.0_cv1.sif busco -c 16 -m protein —auto-lineage -i $LINE 
-o "$LINE".busco-out >> "$LINE"-busco-out” (also see 
Figure 1C-1E). The .fastq sequencing read files are available 
at NCBI (BioProject: PRJNA924561). The assemblies are avail- 
able in Supplemental Data 1 (https://doi.org/10.5281/zenodo. 
7758534, https://zenodo.org/record/7758534). Briefly, our 
genomes showed similar size distributions as the published 
macroalgal genomes with Chlorophyta having the smallest 
range, then Rhodophyta, and Ochrophyta had the largest 
genomes. All ranges shown as interquartile (25%-75%) 
ranges: Chlorophyta, 90.11-267.46 Mbp, mean = 261.48 Mbp; 
Ochrophyta, 231.09-414.02 Mbp, mean = 320.27 Mbp; 
Rhodophyta, 143.28-348.50 Mbp, mean = 290.34 Mbp. The 
BUSCO metrics from the new genomes were higher, on 
average, than the published references both overall and within 
each phylum (Supplemental Table 1) (Chlorophyta, difference 
= 26.7 + 18.5, p = 0.14; Ochrophyta, difference = 31.8 + 12.5, 
p = 0.06; Rhodophyta, difference = 10.5 + 5.3, p = 0.04) as 
analyzed with a standardized, Dockerized BUSCO (v5.5) 
instance. 


ORF and contig decontamination 


A small fraction of microbiome community members, endo- 
phytes, and contaminants were impossible to remove and were 
sequenced. After the unsatisfactory application of multiple 
available decontamination software, we developed a workflow 
to identify and remove contaminating CDSs and contigs. 
Contaminant sequences were thus removed from the macroalgal 
genome assemblies using the in-house decontamination pipe- 
line, Basic Local Alignment Search Tool (BLAST)-Limiting Eukary- 
otic Assembly Contamination, Heuristically (BLEACH; see also 
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Supplemental Data 3). Assembled contigs from the sequenced 
macroalgal cultures were filtered according to this protocol to 
remove fungal, bacterial, and archaeal contamination. This 
approach relied on the homology of predicted CDSs to 
contaminants detected with Diamond BLAST (Buchfink et al., 
2015, 2021). The end process removed as much contamination 
as possible and kept the macroalgal sequences by balancing 
two widely used, although heuristic, algorithms. Translated 
ORFs (tORFs) were queried against the full NCBI nr protein 
database with the command “diamond blastp -db nr -query $ 
—out $.diamond.blastp.txt -evalue $B -threads 28 -k 5 —-outfmt 
6 qseqid glen sseqid sallseqid slen qstart qend sstart send 
qseq sseq evalue bitscore score length pident nident mismatch 
positive gapopen gaps ppos dqframe btop stitle salltitles 
qcovhsp qtitle" for the first BLEACH round (i.e., default sensi- 
tivity). The second BLEACH round used the same BLASTP com- 
mand as the first round except for the parameter “—more-sensi- 
tive" was used to further interrogate the sequences. The 
BLEACH rounds three to six further increased diamond BLAST 
sensitivity (e.g., most sensitive, ultra-sensitive) while increasing 
the allowed E value as a balancing mechanism to limit type | 
and Il errors. 


tORFs from the genomes were queried against the 
NCBI nr protein database using an accelerated version of the 
NCBI BLAST (Buchfink et al., 2015, 2021). Queries that had 
their best hit in a bacterial, fungal, or archaeal organism were 
removed in subsequent rounds with incrementally increased 
sensitivity. If a BLASTP search against nr revealed a bacterial 
hit, the corresponding CDS was heuristically labeled as a 
contaminant. Contigs sourcing contaminant CDSs were removed 
from the original assemblies. Reference macroalgal genomes, 
including Ectocarpus siliculosus, Ectocarpus subulatus, Saccha- 
rina japonica, Chondrus crispus, Gracilariopsis chorda, Porphyra 
umbilicalis, Neopyropia yezoensis, Neoporphyra haitanensis, 
Gracilariopsis lemaneiformis, Kappaphycus alvarezii, Caulerpa 
lentillifera, Ulva mutabilis, Ulva prolifera, and Chara braunii had 
few contaminant sequences (Supplemental Figure 2). After 
decontamination, the newly sequenced genomes had similar 
PFAM distributions to the reference genomes. After six rounds 
of BLEACH decontamination at incrementally increasing sensi- 
tivity, the cleaned genomes were used for downstream analyses 
(see also Supplemental Figure 2). 


To control for over-cleaning, we examined the effect of BLEACH 
on assemblies that were not contaminated, including Ectocarpus 
subulatus and Chlamydomonas reinhardtii CC-1883 (Supplemental 
Data 3). The BLEACH process did not adversely affect these ge- 
nomes, as 99.6596 and 98.8796 of contigs in each were retained. 
To validate our contig removal process, we mapped our original 
sequencing reads to the contaminant ORFs, extracted aligned 
reads, and mapped these reads to the clean contigs. Most reads 
from each assembly were marked as contaminants, and only 
2.72% of the extracted reads mapped back to the clean contigs, 
indicating low rates of contaminants in the resultant filtered macro- 
algal genomes. We validated our decontamination pipeline by 
downsampling each tORFeome to 2000 sequences and perform- 
ing NCBI BLASTP with the same parameters as the Diamond 
BLASTP. Estimations of percentage contamination are shown in 
Supplemental Figure 2 and listed in full in Supplemental Data 2 
for each iteration of the BLEACH program. 
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Phylogenies and molecular-clock divergence estimates 


Phylogenies of the sequenced species (Figure 1A) were 
reconstructed using PhyloT (https://phylot.biobyte.de/index.cgi) 
and the Interactive Tree of Life (https://itol.embl.de). The 
macroalgal species names from Supplemental Table 1 were 
used as queries to build the model based on established NCBI 
taxonomy. 


Evolutionary timelines were reconstructed by averaging diver- 
gence estimates from multiple molecular-clock analyses (Hedges 
et al., 2004; Yoon et al., 2004; Zimmer et al., 2007; Parfrey et al., 
2011; Yang et al., 2016) in timetree.org (Supplemental Figure 1). 
This software uses an extensive public database that unites 
thousands of published studies to form an interactive, time- 
scaled Tree of Life. The chronologies shown in Supplemental 
Figure 1 showcase pivotal moments in geological and 
atmospheric history for seamless comparison with evolutionary 
timelines and time-based trees of the relevant algal species. 


Comparative genomics references 


To discern macroalgal-specific gene collections in the three line- 
ages, we performed a variety of comparative genomics using pub- 
lished reference genomes as validation standards. Decontaminated 
genomes and their translated ORFeomes were compared across 
phyla, climate, and habitats. Reference genomes included in our an- 
alyses were the phaeophyte macroalgae Ectocarpus siliculosus 
(model organism, ~196 Mbp) (Cock et al., 2010b), Ectocarpus 
subulatus (242 Mbp) (Dittami et al., 2020), and Saccharina japonica 
(commercial species, ~535 Mbp) (Ye et al., 2015); the rhodophyte 
Chonarus crispus (model organism, carrageenan production, 
~104.98 Mbp) (Collen et al., 2013), Gracilariopsis chorda (92.18 
Mbp) (Lee et al., 2018), Porphyra umbilicalis (commercial species, 
~87.89 Mbp) (Brawley et al., 2017), Neopyropia yezoensis 
(commercially profitable, susabi-nori, ~107.59 Mb (Nakamura 
et al., 2013), Neoporphyra haitanensis (63.25 Mbp) (Cao et al., 
2020), Gracilariopsis lemaneiformis (88.69 Mbp) (Zhou et al., 2013), 
and Kappaphycus alvarezii (commercial species, 336.72 Mbp) (Jia 
et al., 2020); and the Chlorophyta Caulerpa lentillifera (28.77 Mbp) 
(Arimoto et al., 2019), Ulva mutabilis (98.48 Mbp) (De Clerck et al., 
2018), Ulva prolifera (887.88 Mbp) (He et al., 2021), and Chara 
braunii (1751.21 Mbp) (Nishiyama et al., 2018). 


Structural and functional annotation 


We used hidden Markov models to identify CDSs (https:// 
github.com/KorfLab/SNAP), PFAMs (http://nmmer.org/), and 
VFAMs (https://derisilab.ucsf.edu/software/vFam/) from the mac- 
roalgal genomes. A custom Arabidopsis thaliana HMM yielded 
the most and highest-quality gene predictions for most species. 
To remove bias from variable gene calling due to other poor-quality 
HMMs, a “one-size-fits-all” approach was used with the A. thaliana 
HMM to predict CDS and translated peptides in the 124 macroalgal 
genomes. The newly sequenced genomes were predicted to be 
diploid or polyploid. In general, assemblies were haploid unless 
considerable heterozygosity was present. 


The PFAM-A-v31.1 database was used as a query for translated 
ORFs. The command was “hmmsearch -noali -E 0.0001 -cpu 
28 -domtblout $OUT $IN.aa.fa”. All reported values in the 
columns in the output files are described in the HMMer user guide 
(http://eddylab.org/software/hmmer/Userguide.pdf). These include 
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sequence coordinates for alignment and matches with HMM 
models, E values, percentage identity, and bias. BLAST (https:// 
blast.ncbi.nlm.nih.gov/Blast.cg) was used to find protein 
homology with the accelerated Diamond BLAST algorithm 
(Buchfink et al., 2015, 2021) for annotation and downstream 
decontamination. The Diamond BLASTP | command 
was “diamond blastp -db nr -query $assembly.aa.fa -out 
$assembly.diamond.blastp.txt -more-sensitive -threads 4 -k 1 — 
outfmt 6 qseqid qlen sseqid sallseqid slen qstart qend sstart 
send qseq sseq evalue bitscore score length pident nident 
mismatch positive gapopen gaps ppos qframe btop stitle 
salltitles qcovhsp qtitle." 


Enzyme Commission (EC) codes were obtained using PFAM asso- 
ciations described using the EC domain-miner tool (Alborzi et al., 
2017) The association list is a downloadable file, and EC 
designations were obtained with the command: “while read |; do 
grep -w $I $PFAM.txt > $EC.out.” Kyoto Encyclopedia of 
Genes and Genomes (KEGG) designations were obtained 
from EC IDs using EC2KEGG (Porollo, 2014). The EC codes were 
used to retrieve KEGG terms using EC2KEGG (Porollo, 2014). 
Briefly, the scripts in this package use “libwww-perl” (for Internet 
communication with KEGG, $hitp://search.cpan.org/$mschilli/ 
libwww-perl-6.06/lib/LWP.pm), “Text-NSP” (for computing the 
Fisher exact test, $http://search.cpan.org/$tpederse/Text-NSP- 
1.27/, and “Statistics-Multtest” (for correcting p values on 
multiple hypotheses testing, $http://search.cpan.org/$jokergoo/ 
Statistics-Multtest-0.13/). Nodes in the iPath diagram 
(Figure 3) represent enzymatic reactions with corresponding EC 
designations; bold edges indicate present ECs. For the iPATH 
analysis, EC codes were used to reconstruct metabolic 
pathways on the iPath website (https://pathways.embl.de/). The 
population of nodes in the reaction maps tightly corresponds 
with estimated origin times for each macroalgal lineage and 
further highlights the evolutionary innovations that multicellular 
algae have made in their respective habitats and genetic 
constraints. 


Ternary analysis 


To elucidate the propensity for PFAM domains to be found in the 
Rhodophyta, Chlorophyta, or Ochrophyta macroalgal genomes 
we surveyed, a ternary analysis was performed. Ternary analyses 
are commonly used in the physical sciences to determine the 
composition of components from three entities (see https:// 
mathworld.wolfram.com/TernaryDiagram.html). Here, the three 
macroalgal lineages were interrogated regarding their ternary dis- 
tribution to yield a high-level picture of the degree of unique and 
shared PFAMs. Ternary analysis was carried out in R using the 
package "ggtern" (http://www.ggtern.com), an extension to 
ggplot2. A normal simplex diagram graphs in k + 1 dimensions, 
whereas a ternary plot graphs in k = 2 with k + 1 = 3 vertices. 
Values for PFAM averages per phylum were used as inputs and 
represented either x, y, or z plots. The threshold of 8096 purity 
was used to define clade-specific PFAMs, and PFAMs above 
this threshold in each of the phyla were used for downstream in 
domain-centric ontology enrichment analyses. 


Domain-centric gene ontology enrichment 


Functional enrichment analyses were carried out on subsets of 
PFAMs as indicated using Domain-centric Gene Ontology 
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(dcGO; Fang and Gough, 2013). Preliminary HMMsearch 
(hmmer.org) was performed on translated ORFs with the 
command “hmmsearch -noali -E 0.0001 -cpu 28 -domtblout 
$OUT $IN.aa.fa.” The software dcGO statistically infers 
associations between GO terms and combinations of PFAM 
domains. Uploaded queries were PFAM lists from selected 
sets in the ternary and Venn analyses. A false discovery rate 
(FDR) threshold of 0.01 was used for detection of enrichment 
and the full set of GO terms was queried. Briefly, distributions 
of GO terms are tested for overrepresentation (enrichment) 
against a collection of all GO terms. The analysis takes the 
hypergeometric distribution as a null-hypothesis statistical tests 
(equivalent to Fisher’s exact tests) infers PFAM to GO associa- 
tions. An overall p value and a corresponding overall hypergeo- 
metric score (Z score) is calculated by the observed PFAM count 
minus the expected PFAM count and divided by the standard 
deviation under the hypergeometric distribution with all available 
UniProt proteins. A secondary p value is calculated using only 
UniProt proteins in GO terms to establish a dual constraint filter 
for significance of matches. The strength of the inference is seen 
in the Z score. 


Comparisons of micro- and macroalgal genomes 


Microalgal tORFeomes were obtained from our previous work 
and HMMsearch was performed similarly to the macroalgal 
tORFeomes. Parsed HMMsearch results (i.e., PFAM matrices) 
were used in comparative analyses. Means were analyzed 
using one-way analysis of variance (ANOVA), where each com- 
parison’s p value, logworth, FDR (or adj. p value), and FDR log- 
worth are reported (Supplemental Table 3). Protein family 
averages were compared across phyla between macroalgal 
and microalgal genome collections and screened for 
significant differences (Supplemental Table 4). The combined 
set included 251 algal genomes in total. The microalgal 
genome collection was composed of unicellular chlorophytes, 
ochrophytes, and rhodophytes sequenced in our previous 
work and the reference microalgal genomes included therein 
for analyses (Nelson et al., 2021). These included 76 
chlorophytes, 44 ochrophytes, and seven rhodophytes. 
HMMsearch was applied on the microalgal tORFs in the same 
manner as the macroalgal genomes (Nelson et al., 2021). The 
combined micro-/macroalgal HMMsearch results were used 
comparative analyses using the entire PFAM set (~11 000 
PFAMs) and reduced sets (e.g., the adhesome set of 110 
PFAMs and the transporter set of 183 PFAMs). 


Pathway analysis 


Pathway discovery was facilitated by ECdomainMiner, a content- 
based filtering approach to automatically infer associations 
between EC codes and PFAM domains (Alborzi et al., 2017). 
Recovered EC codes were used to reconstruct metabolic 
pathways from the KEGG using iPATH3 (Letunic et al., 2008). 
The EC list can be uploaded to the iPATH3 website (https:// 
pathways.embl.de) for an interactive exploration of the reactions 
and compounds comprising these macroalgal-specific path- 
ways. Metabolic pathways were reconstructed using the indi- 
cated gene lists in the Interactive Pathways Explorer (iPath) (a 
web-based tool for the visualization, analysis, and customization 
of metabolic and other known pathway maps; https://pathways. 
embl.de/). EC codes were derived from PFAMs using the 
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ECDomainMiner software (Alborzi et al., 2017). This tool 
scavenges the protein data bank (PDB) (Bittrich et al., 2021; 
Flores et al., 2021; Behzadi and Gajdacs, 2022; Goodsell and 
Burley, 2022) to discover X-ray crystallography or NMR 
spectroscopy-resolved PDB entries corresponding to PFAM do- 
mains and classify them into EC codes. Only GOLD-ranked EC 
links were used. 


Adhesome analysis 


We defined the macroalgal adhesome as including cadherin, in- 
tegrin, armadillo (ARM), lectin, F actin-binding CH, Kringle, PH, 
SH2, and LIM domains (n = 110 distinct domains). The acces- 
sions for the queried adhesome domains are in Supplemental 
Table 4. The cadherin PFAMs were queried were PF00028, 
PF01049, PF08266, PF08374, PF08758, PF12733, PF15974, 
PF16184, PF16492, PF17756, PF17803, PF17812, PF17813, 
PF17892, and PF18432. PFAM subsets were compared among 
phyla and native climate of the surveyed macroalgae (n Ochro- 
phyta = 42, n Rhodophyta = 67, n Chlorophyta = 11). Means 
were analyzed using ANOVA, where each comparison’s p value, 
logworth, FDR or adj. p value, and FDR logworth are reported in 
Supplemental Table 4. 


Predictive modeling with neural networks 


Predictive modeling using neural networks was carried out in the 
JMP Neural Networks module. Four sequential models were used 
to fit eight TanH, four linear, and one Gaussian formulae with a 
learning rate of 1.0. The input datasets were PFAM calls from 
the micro- and macroalgal genomes. Python, C, and Java code 
for the formulae necessary to reproduce the dNN are provided 
in the supplemental information. Deployment was on an high-per- 
formance computing (HPC) cluster with a LINUX CentOS oper- 
ating system. Jobs were dispatched with SLURM; SLURM logs 
for programs run in this project are in the supplemental informa- 
tion for their corresponding analyses. 


Viral family and EsV-1-7 (co)domain analyses 


Hidden Markov models were used to identify VFAMs in the mac- 
roalgal ORFs as in Skewes-Cox et al. (2014) (https://derisilab. 
ucsf.edu/software/vFam/). Detected VFAMs were compared 
across phyla, habitat, and climate. As EsV-1-7 is a PFAM, not 
VFAM, domain, sequences containing EsV-1-7 domains were ex- 
tracted from the PFAM HMMsearch results. Additional PFAMs on 
the EsV-1-7 sequences (i.e., codomains) were extracted in the 
same manner from the HMMsearch results (Supplemental Data 
3, PFAM HMMSEARCH) and used for statistical comparative an- 
alyses by phyla, habitat, and climate. 


Statistical analyses 


One-way ANOVA statistical analyses (Blanca et al., 2017) were 
performed for comparisons of PFAMs, VFAMs, and their subsets. 
Batch t-tests comparing VFAM, PFAM, GO, and EVOP subsets 
were performed in the JMP Response Screening platform to 
screen for significant differences with practical effect sizes. See 
Supplemental Tables 1—5 in Supplemental Data 4 for raw and 
processed data frames used in these analyses. Standardized 
residues were used to compare means in >100 000 t-tests 
while controlling for FDR (Benjamini et al., 2001). Robust Huber 
M-estimation (using maximum-likelihood type estimators) (El 
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Karoui et al., 2013) was used to reduce the sensitivity of the 
tests to outliers. Briefly, this method reduces the influence of 
outliers by minimizing their weights for the comparative 
statistics algorithm. The most informative information from 
these statistical tests for the general reader is in the "Response 
Screen Compare Means" tables, which include tests for all 
possible comparisons and results for practical equivalences 
and differences. Comparisons not passing either of these tests 
are labeled as "inconclusive." The practical difference is the 
difference in means having practical interest. Standard 
deviation estimates were computed from their interquartile 
ranges. The estimate was c = (/QR)/(1.3489795). The 
practical difference was computed as 6(c) multiplied by 0.10 as 
a proxy for the practical difference proportion. Practical 
difference p values are given for tests of whether the absolute 
value of the mean difference in Y between comparison levels is 
less than or equal to the practical difference. Low p values 
indicate that the absolute difference exceeds the practical 
difference, indicating that the difference in the comparison is of 
practical significance. 


Practical equivalence p values were computed using the two 
one-sided tests method to test for practical differences between 
means (Schuirmann, 1987). The practical difference specifies 
a threshold difference for which smaller differences are 
considered practically equivalent. One-sided t-tests are con- 
structed for two null hypotheses: the true difference exceeds 
the practical difference; the true difference is less than the nega- 
tive of the practical difference. If both tests reject, this indicates 
that the absolute difference in the means falls within the practical 
difference. Therefore, the groups are considered practically 
equivalent. The practical equivalence p value is the largest 
p value obtained on the one-sided t-tests. Low practical equiva- 
lence p values indicate that the mean response for the top com- 
parison level is equivalent, in a practical sense, to the mean for the 
lower level. 


DATA AND CODE AVAILABILITY 


The .fastq sequencing read files are available at the NCBI 
sequence read archive (BioProject: PRJNA924561). The 
data and code used in data processing and analyses are 
available online in Zenodo repositories as described in 
the supplemental information: i.e., https://doi.org/10.5281/ 
zenodo.7758534 (Supplemental Data 1,  https://zenodo. 
org/record/7758534), https://doi.org/10.5281/zenodo.7751045 
(Supplemental Data 2, https://zenodo.org/record/7751045), 
https://doi.org/10.5281/zenodo.7947301 (Supplemental Data 
3, https://zenodo.org/record/7947301), and https://doi.org/10. 
5281/zenodo.10581453 (Supplemental Data 4, https://zenodo. 
org/records/10581 453). 
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